From the description of a Kaggle Machine Learning Challenge at https://www.kaggle.com/c/titanic
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
In this demo we will use MLDB to train a classifier to predict whether a passenger would have survived the Titanic disaster.
pymldb
and other importsIn this demo, we will use pymldb
to interact with the REST API: see the Using pymldb
Tutorial for more details.
In [12]:
from pymldb import Connection
mldb = Connection("http://localhost")
#we'll need these also later!
import numpy as np
import pandas as pd, matplotlib.pyplot as plt, seaborn, ipywidgets
%matplotlib inline
See the Loading Data Tutorial guide for more details on how to get data into MLDB.
In [13]:
mldb.put('/v1/procedures/import_titanic_raw', {
"type": "import.text",
"params": {
"dataFileUrl": "http://public.mldb.ai/titanic_train.csv",
"outputDataset": "titanic_raw",
"runOnCreation": True
}
})
Out[13]:
See the Query API documentation for more details on SQL queries.
In [14]:
mldb.query("select * from titanic_raw limit 5")
Out[14]:
As a first step in the modelling process, it is often very useful to look at summary statistics to get a sense of the data. To do so, we will create a Procedure of type summary.statistics
and store the results in a new dataset called titanic_summary_stats
:
In [15]:
print mldb.post("/v1/procedures", {
"type": "summary.statistics",
"params": {
"inputData": "SELECT * FROM titanic_raw",
"outputDataset": "titanic_summary_stats",
"runOnCreation": True
}
})
We can take a look at numerical columns:
In [16]:
mldb.query("""
SELECT * EXCLUDING(value.most_frequent_items*)
FROM titanic_summary_stats
WHERE value.data_type='number'
""").transpose()
Out[16]:
We will create another Procedure of type classifier.experiment
. The configuration
parameter defines a Random Forest algorithm.
In [17]:
result = mldb.put('/v1/procedures/titanic_train_scorer', {
"type": "classifier.experiment",
"params": {
"experimentName": "titanic",
"inputData": """
select
{Sex, Age, Fare, Embarked, Parch, SibSp, Pclass} as features,
label
from titanic_raw
""",
"configuration": {
"type": "bagging",
"num_bags": 10,
"validation_split": 0,
"weak_learner": {
"type": "decision_tree",
"max_depth": 10,
"random_feature_propn": 0.3
}
},
"kfold": 3,
"modelFileUrlPattern": "file://models/titanic.cls",
"keepArtifacts": True,
"outputAccuracyDataset": True,
"runOnCreation": True
}
})
auc = np.mean([x["resultsTest"]["auc"] for x in result.json()["status"]["firstRun"]["status"]["folds"]])
print "\nArea under ROC curve = %0.4f\n" % auc
The procedure above created for us a Function of type classifier
.
In [18]:
@ipywidgets.interact
def score( Age=[0,80],Embarked=["C", "Q", "S"], Fare=[1,100], Parch=[0,8], Pclass=[1,3],
Sex=["male", "female"], SibSp=[0,8]):
return mldb.get('/v1/functions/titanic_scorer_0/application', input={"features": locals()})
In [19]:
test_results = mldb.query("select * from titanic_results_0 order by score desc")
test_results.head()
Out[19]:
Here's an interactive way to graphically explore the tradeoffs between the True Positive Rate and the False Positive Rate, using what's called a ROC curve.
NOTE: the interactive part of this demo only works if you're running this Notebook live, not if you're looking at a static copy on http://docs.mldb.ai. See the documentation for Running MLDB.
In [20]:
@ipywidgets.interact
def test_results_plot( threshold_index=[0,len(test_results)-1]):
row = test_results.iloc[threshold_index]
cols = ["trueNegatives","falsePositives","falseNegatives","truePositives",]
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
test_results.plot(ax=ax1, x="falsePositiveRate", y="truePositiveRate",
legend=False, title="ROC Curve, threshold=%.4f" % row.score).set_ylabel('truePositiveRate')
ax1.plot(row.falsePositiveRate, row.truePositiveRate, 'gs')
ax2.pie(row[cols], labels=cols, autopct='%1.1f%%', startangle = 90,
colors=['lightskyblue','lightcoral','lightcoral', 'lightskyblue'])
ax2.axis('equal')
f.subplots_adjust(hspace=.75)
plt.show()
Let's create a function of type classifier.explain
to help us understand what's happening here.
In [21]:
mldb.put('/v1/functions/titanic_explainer', {
"id": "titanic_explainer",
"type": "classifier.explain",
"params": { "modelFileUrl": "file://models/titanic.cls" }
})
Out[21]:
NOTE: the interactive part of this demo only works if you're running this Notebook live, not if you're looking at a static copy on http://docs.mldb.ai. See the documentation for Running MLDB.
In [22]:
@ipywidgets.interact
def sliders( Age=[0,80],Embarked=["C", "Q", "S"], Fare=[1,100], Parch=[0,8], Pclass=[1,3],
Sex=["male", "female"], SibSp=[0,8]):
features = locals()
x = mldb.get('/v1/functions/titanic_explainer/application', input={"features": features, "label": 1}).json()["output"]
df = pd.DataFrame(
{"%s=%s" % (feat, str(features[feat])): val for (feat, (val, ts)) in x["explanation"]},
index=["val"]).transpose().cumsum()
pd.DataFrame(
{"cumulative score": [x["bias"]]+list(df.val)+[df.val[-1]]},
index=['bias'] + list(df.index) + ['final']
).plot(kind='line', drawstyle='steps-post', legend=False, figsize=(15, 5),
ylim=(-1, 1), title="Score = %.4f" % df.val[-1]).set_ylabel('Cumulative Score')
plt.show()
When we sum up the explanation values in the context of the correct label, we can get an indication of how important each feature was to making a correct classification.
In [23]:
df = mldb.query("""
select label, sum(
titanic_explainer({
label: label,
features: {Sex, Age, Fare, Embarked, Parch, SibSp, Pclass}
})[explanation]
) as *
from titanic_raw group by label
""")
df.set_index("label").transpose().plot(kind='bar', title="Feature Importance", figsize=(15, 5))
plt.xticks(rotation=0)
plt.show()
In [24]:
mldb.put('/v1/plugins/pytanic', {
"type":"python",
"params": {"address": "git://github.com/datacratic/mldb-pytanic-plugin"}
})
Out[24]:
Now you can browse to the plugin UI.
NOTE: this only works if you're running this Notebook live, not if you're looking at a static copy on http://docs.mldb.ai. See the documentation for Running MLDB.
Check out the other Tutorials and Demos.